For all R users, CRAN is the place were you can find most of the packages you need to run code. Last month, I saw this Xkcd drawing :
And this remind me of something : behind packages, you can find people. And this people are at the heart of all your analytics pipelines because they build, maintain and develop R packages. It’s their ideas and their work, discussion and collaborations that made R as it is. One example of it is the fantastic story of the pipe (%>%) by Adolfo Alvarez.
Let’s talk a bit about this people and their packages !
Not all packages on CRAN are equal. They are all useful and all needed a big bunch of work. But some are more useful than others, also because they are more general.
I try to appreciate it by comparing two things :
For the first one, I used {cranlogs}, a package which show the download statistics for the RStudio CRAN mirror, from 2013 to now. It’s not all the CRAN downloads but it’s a nice part of them. We choose to compute the median download by month for each package, since it’s first appearance (so for months with more than 0 download). I did so since we got some strange stats for packages like {tidyverse} with an incredible amount of downloads in november 2018 or a big amount of downloads for {aws.s3} over the last year that I couldn’t understand.
For the dependency count, I used some centrality measures from graph network theory. What is important to me is to know how much times the package is listed as a dependency (so in-degree centrality), how many dependency it has (out-degree centrality) and if the package is “important” in the global network of the dependency graph. For this last measure, I used PageRank centrality, the same algorithm as Google.
With all of that, we get the following table with all the packages on CRAN, ranked by PageRank centrality :
I gathered the main infos in a graph, crossing centrality (in-degree) and popularity (downloads). I annotate some main areas :
on the right of the graph, a lot of packages, very popular like {dplyr} or {ggplot2} (the whole {tidyverse} collection in fact) or {httr}. This packages are direct downloads from the people. They are widely-used. A lot of packages depends of them.
on the top-left, some more unknown packages to neophytes like {ellipsis}, {vctrs}, {pillar}, etc. This packages have a lot of downloads but are only listed as a dependency by a few but very important packages. In fact they are more “infrastructure” packages. They are the foundations of some more-used packages and they hide the more low-level functions. So we can guess that the downloads are not direct downloads but come from the download of other packages.
some other packages, in the middle between the two above categories, like {rlang} or {Rcpp} could be the most important R packages. We should also notice that they have a 0 out-degree centrality, so they depends on zero other package (they are root packages).
What can we learn from this little work ? First, collaboration is important in the R World ! People are designing packages together.